In this EDA, I explore a dataset of airline on-time performance to try and find insights to flight delay and cancellations. The dataset used is a very large dataset that consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are over 120 million observations (flights) in this dataset for flights. The data was compressed into individual CSV files for each year.
I chose to explore this particular dataset because it would allow me learn new skills and optimization techniques for handling large datasets.
Due to the size of this dataset, it would very difficult to load the data into a Pandas dataframe in memory without reducing it to a very small subset of the data, so I decided to employ the use of R markdown (instead of Jupyter notebook) so that I can use R packages, along with SQL queries, to wrangle the data into a more summarized format that a Pandas dataframe can handle.
library(tidyverse)
library(skimr)
library(dplyr)
library(here) # To locate files based on current working directory
library(janitor) # Tools for for examining and cleaning dirty data.
library(reticulate) # For reading R objects in Python
library(data.table) # For reading large datasets efficiently
library(inborutils) # For reading CSV files and converting to SQL
library(DBI) # Interface to connect with SQL databases
library(RSQLite) # For connecting with SQL databasesimport pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pathlib
import osThis dataset contains 21 large CSV files of flight data for each year from 1987 to 2008, as well as some other CSV files for which contain extra information. I will not directly read any of the large CSV files because it would take too much memory. I will read in the other CSV files, then use R libraries and SQL queries to read in smaller samples of the data to explore.
path_column_data = "airline/airline_dataset_column_info.csv"
path_airport_data = "airline/other_data/airports.csv"
path_carrier_data = "airline/other_data/carriers.csv"
path_plane_data = "airline/other_data/plane-data.csv"
path_main = "airline/main_data" # Path to the main (yearly) CSV files
def absolute_file_paths(directory):
data = {}
files = os.listdir(directory)
paths = [f"{path_main}/{file}" for file in files]
for idx in range(len(files)):
name = files[idx].split('.')[0]
data[name] = paths[idx]
return data
main_files = absolute_file_paths(path_main)# Data on column descriptions for the main files
column_data = pd.read_csv(path_column_data)
# Data on different airports
airport_data = pd.read_csv(path_airport_data)
# Airline companies
carrier_data = pd.read_csv(path_carrier_data)
# Plane data, specifications and other info
plane_data = pd.read_csv(path_plane_data)# Info on column descriptions for the main files
column_data <- read.csv(py$path_column_data)
# Data on different airports
airport_data <- read.csv(py$path_airport_data)
# Information on airline companies
carrier_data <- read.csv(py$path_carrier_data)
# Plane data, specifications and other info
plane_data <- read.csv(py$path_plane_data)For the main data, I have written a script that reads in the data in smaller chunks and stores them in a database file (sqlite). Each year’s data is stored in its own table. I also store the other data in their own tables so that later, when needed, I can reference them using SQL joins. The process takes a while to run because of the large dataset (over 30 minutes on my PC).
Also, the original files are named by the year they represent. It is not be good practice to name a database table starting with a number, so the script adds a prefix to each name.
For now, I have added a condition so the code will only be executed if the sqlite file is not detected in the project root directory.
path_main <- "airline/main_data"
db_file <- "airline_data.sqlite"
save_in_sql <- function() {
main_files <- list.files(path = path_main, full.names = TRUE)
if (!file.exists(db_file)) {
# Creating the airport data table
inborutils::csv_to_sqlite(
csv_file = py$path_airport_data,
table_name = "airports",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the carrier data table
inborutils::csv_to_sqlite(
csv_file = py$path_carrier_data,
table_name = "carriers",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the plane data table
inborutils::csv_to_sqlite(
csv_file = py$path_plane_data,
table_name = "planes",
sqlite_file = db_file,
show_progress_bar = FALSE)
# Creating the tables for each of the years' data
for (csv in main_files) {
csv_name <- strsplit(csv, "/|[.]")[[1]] # Splitting the csv name by "/" or "."
csv_name <- csv_name[length(csv_name)-1] # Getting the second last element of the list
table_name <- paste("table", csv_name, sep="_")
print("Updating table: %s", table_name)
inborutils::csv_to_sqlite(
csv_file = csv,
sqlite_file = db_file,
table_name = table_name,
pre_process_size = 1000,
chunk_size = 50000,
show_progress_bar = TRUE)
}
}
}
save_in_sql()Now I inspect the database file to be sure that all tables have been added and updated properly
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
db_tables <- dbListTables(airline_db) # List out the tables in the db
print(db_tables)## [1] "airports" "carriers" "planes" "sqlite_stat1" "sqlite_stat4"
## [6] "table_1987" "table_1988" "table_1989" "table_1990" "table_1991"
## [11] "table_1992" "table_1993" "table_1994" "table_1995" "table_1996"
## [16] "table_1997" "table_1998" "table_1999" "table_2000" "table_2001"
## [21] "table_2002" "table_2003" "table_2004" "table_2005" "table_2006"
## [26] "table_2007" "table_2008"
db_1993_cols <- dbListFields(airline_db, "table_1993") # Column names for specific table in db
print(length(db_1993_cols))## [1] 29
We can see from the above result that there are 29 columns in the table and this is the same across all the tables (the yearly tables), they all have the same columns, but we don’t know exactly how many rows are in each table.
The code below is a script/query to return exactly the number of rows (observations) that are in each table. The query can take a few minutes to execute the first time.
# count_rows <- function() {
# Table = character() # Empty vector/list to store table names
# Row_Count = integer() # Empty vector/list to store row counts
#
# for (table in db_tables) {
# query_rows <- sprintf("SELECT COUNT(*) AS Rows FROM %s", table)
# row_count <- dbGetQuery(airline_db, query_rows)[[1]]
#
# Table <- c(Table, table) # Appending each table name to the vector
# Row_Count <- c(Row_Count, row_count) # Appending each row count to the vector
# }
#
# df_row_count <- data.frame(Table, Row_Count)
# return(df_row_count)
# }
# table_row_count <- count_rows()
# table_row_countWe can now see the number of rows in each table, which sums up to over 120 million observations. To test the SQL connection, I load in the first 500 rows of data from a particular year (2005 dataset in this case) using SQL and the R interface
query_test <- "SELECT * FROM table_2005 LIMIT 500"
tbl(airline_db, sql(query_test)) # Runs the query and displays results without loading it in memory## # Source: SQL [?? x 29]
## # Database: sqlite 3.39.2 [O:\GitHub\data_analysis\alx\airline_performance_analysis\airline_data.sqlite]
## Year Month DayofMo…¹ DayOf…² DepTime CRSDe…³ ArrTime CRSAr…⁴ Uniqu…⁵ Fligh…⁶
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2005 1 28 5 1603 1605 1741 1759 UA 541
## 2 2005 1 29 6 1559 1605 1736 1759 UA 541
## 3 2005 1 30 7 1603 1610 1741 1805 UA 541
## 4 2005 1 31 1 1556 1605 1726 1759 UA 541
## 5 2005 1 2 7 1934 1900 2235 2232 UA 542
## 6 2005 1 3 1 2042 1900 9 2232 UA 542
## 7 2005 1 4 2 2046 1900 2357 2232 UA 542
## 8 2005 1 5 3 NA 1900 NA 2232 UA 542
## 9 2005 1 6 4 2110 1900 8 2223 UA 542
## 10 2005 1 7 5 1859 1900 2235 2223 UA 542
## # … with more rows, 19 more variables: TailNum <chr>, ActualElapsedTime <dbl>,
## # CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## # Origin <chr>, Dest <chr>, Distance <dbl>, TaxiIn <dbl>, TaxiOut <dbl>,
## # Cancelled <dbl>, CancellationCode <chr>, Diverted <dbl>,
## # CarrierDelay <dbl>, WeatherDelay <dbl>, NASDelay <dbl>,
## # SecurityDelay <dbl>, LateAircraftDelay <dbl>, and abbreviated variable
## # names ¹DayofMonth, ²DayOfWeek, ³CRSDepTime, ⁴CRSArrTime, ⁵UniqueCarrier, …
top_rows <- dbGetQuery(airline_db, query_test) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when doneNow lets look at some summary statistics for the data. I will read in the first 5,000 rows of the 2007 flight data.
top_2007 = pd.read_csv(main_files['2007'], nrows=5000)
top_2007.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5000 entries, 0 to 4999
## Data columns (total 29 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Year 5000 non-null int64
## 1 Month 5000 non-null int64
## 2 DayofMonth 5000 non-null int64
## 3 DayOfWeek 5000 non-null int64
## 4 DepTime 4964 non-null float64
## 5 CRSDepTime 5000 non-null int64
## 6 ArrTime 4963 non-null float64
## 7 CRSArrTime 5000 non-null int64
## 8 UniqueCarrier 5000 non-null object
## 9 FlightNum 5000 non-null int64
## 10 TailNum 5000 non-null object
## 11 ActualElapsedTime 4963 non-null float64
## 12 CRSElapsedTime 5000 non-null int64
## 13 AirTime 4963 non-null float64
## 14 ArrDelay 4963 non-null float64
## 15 DepDelay 4964 non-null float64
## 16 Origin 5000 non-null object
## 17 Dest 5000 non-null object
## 18 Distance 5000 non-null int64
## 19 TaxiIn 5000 non-null int64
## 20 TaxiOut 5000 non-null int64
## 21 Cancelled 5000 non-null int64
## 22 CancellationCode 36 non-null object
## 23 Diverted 5000 non-null int64
## 24 CarrierDelay 5000 non-null int64
## 25 WeatherDelay 5000 non-null int64
## 26 NASDelay 5000 non-null int64
## 27 SecurityDelay 5000 non-null int64
## 28 LateAircraftDelay 5000 non-null int64
## dtypes: float64(6), int64(18), object(5)
## memory usage: 1.1+ MB
Most of the columns are numeric, some indicating arrival and departure, as well as different causes of delays. There are some binary columns such as “Cancelled” and “Diverted” which are important variables to analyze.
For this EDA, I am interested in exploring some of the ideas suggested on the source website which are:
Generally, I am interested in exploring the cause of flight delays and cancellations.
For this investigation, I will get the best insights by focusing on the “delay” columns. By analyzing the delays on each day of the week and each month, I believe I can get a good idea of the best times to fly. I will explore data for a single year. Then later on, I will compare the data across the other years to see if there are similar patterns across the years.
I will start by analyzing the departure delays:
DepDelay. I will be using the 2007 dataset. Because the
data is so large, I will only read in some select columns into the
dataframe.
df_2007 = pd.read_csv(main_files['2007'], usecols = ['Month', 'DayofMonth', 'ArrDelay', 'DepDelay', 'Distance'], nrows = 2_000_000)
df_2007.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000000 entries, 0 to 1999999
## Data columns (total 5 columns):
## # Column Dtype
## --- ------ -----
## 0 Month int64
## 1 DayofMonth int64
## 2 ArrDelay float64
## 3 DepDelay float64
## 4 Distance int64
## dtypes: float64(2), int64(3)
## memory usage: 76.3 MB
df_2007.describe()
# Check for missing values## Month DayofMonth ArrDelay DepDelay Distance
## count 2.000000e+06 2.000000e+06 1.935147e+06 1.939160e+06 2.000000e+06
## mean 2.182453e+00 1.555414e+01 1.050200e+01 1.184517e+01 7.000558e+02
## std 9.717774e-01 8.695236e+00 3.906322e+01 3.582602e+01 5.458290e+02
## min 1.000000e+00 1.000000e+00 -3.120000e+02 -3.050000e+02 1.100000e+01
## 25% 1.000000e+00 8.000000e+00 -9.000000e+00 -4.000000e+00 3.080000e+02
## 50% 2.000000e+00 1.600000e+01 0.000000e+00 0.000000e+00 5.460000e+02
## 75% 3.000000e+00 2.300000e+01 1.500000e+01 1.200000e+01 9.300000e+02
## max 4.000000e+00 3.100000e+01 1.564000e+03 1.547000e+03 4.962000e+03
df_2007['DepDelay'].isnull().sum()## 60840
df_2007[df_2007.ArrDelay.notnull()]## Month DayofMonth ArrDelay DepDelay Distance
## 0 1 1 1.0 7.0 389
## 1 1 1 8.0 13.0 479
## 2 1 1 34.0 36.0 479
## 3 1 1 26.0 30.0 479
## 4 1 1 -3.0 1.0 479
## ... ... ... ... ... ...
## 1999995 4 30 22.0 0.0 741
## 1999996 4 30 48.0 50.0 1024
## 1999997 4 30 -12.0 -1.0 646
## 1999998 4 30 12.0 20.0 589
## 1999999 4 30 -21.0 -5.0 461
##
## [1935147 rows x 5 columns]
# Changing column data types to reduce memory usage
df_2007 = df_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'ArrDelay':'float32', 'DepDelay':'float32'})binsize = 30
bins = np.arange(0, df_2007['DepDelay'].max()+binsize, binsize)
plt.figure(figsize=[14, 8])
plt.hist(data = df_2007, x = 'DepDelay', bins = bins)## (array([7.76003e+05, 1.39764e+05, 5.98980e+04, 3.07130e+04, 1.66560e+04,
## 9.33500e+03, 5.72500e+03, 3.33200e+03, 1.94700e+03, 1.26600e+03,
## 7.70000e+02, 5.22000e+02, 3.16000e+02, 2.22000e+02, 1.61000e+02,
## 1.11000e+02, 1.02000e+02, 7.30000e+01, 5.90000e+01, 3.80000e+01,
## 3.70000e+01, 3.60000e+01, 2.90000e+01, 3.10000e+01, 1.60000e+01,
## 2.20000e+01, 1.10000e+01, 1.60000e+01, 1.30000e+01, 1.50000e+01,
## 1.80000e+01, 1.20000e+01, 7.00000e+00, 7.00000e+00, 1.50000e+01,
## 7.00000e+00, 1.10000e+01, 4.00000e+00, 3.00000e+00, 8.00000e+00,
## 1.00000e+00, 2.00000e+00, 1.00000e+00, 2.00000e+00, 4.00000e+00,
## 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
## 0.00000e+00, 1.00000e+00]), array([ 0., 30., 60., 90., 120., 150., 180., 210., 240.,
## 270., 300., 330., 360., 390., 420., 450., 480., 510.,
## 540., 570., 600., 630., 660., 690., 720., 750., 780.,
## 810., 840., 870., 900., 930., 960., 990., 1020., 1050.,
## 1080., 1110., 1140., 1170., 1200., 1230., 1260., 1290., 1320.,
## 1350., 1380., 1410., 1440., 1470., 1500., 1530., 1560.]), <BarContainer object of 52 artists>)
plt.xlabel('Departure Delays (mins)')
plt.show()The distribution is skewed to the left and there is a short tail. A large majority of the data falls within the range of 0 and 250 minutes. I would have gone for a logarithmic scale but this data has negative values (because there are flights that took off before the expected departure time). There are also some missing values.
# Flights that took off over 25 minutes earlier
df_2007[df_2007.DepDelay < -25].info()
# Lowest departure delay## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 165 entries, 99831 to 1977687
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 165 non-null int8
## 1 DayofMonth 165 non-null int8
## 2 ArrDelay 163 non-null float32
## 3 DepDelay 165 non-null float32
## 4 Distance 165 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 4.2 KB
df_2007[df_2007.DepDelay < 0].DepDelay.min()## -305.0
We can see that there are many flights that took off before the expected departure time (almost 50% of all the flights in that year). That is not unusual, especially if it falls within a few minutes and all passengers are available, but there are many flights that took off unusually early (over 30 minutes early, even up to 5 hours early). There can be many reasons for this but for now since this exploration is mainly focused on delay times and there are so many records to work with, I will only assess flights that were actually delayed,
df_2007_delayed = df_2007[df_2007.DepDelay > 0]
df_2007_delayed.info()## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 867613 entries, 0 to 1999998
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 867613 non-null int8
## 1 DayofMonth 867613 non-null int8
## 2 ArrDelay 865260 non-null float32
## 3 DepDelay 867613 non-null float32
## 4 Distance 867613 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 21.5 MB
df_2007_delayed.describe()## Month DayofMonth ... DepDelay Distance
## count 867613.000000 867613.000000 ... 867613.000000 867613.000000
## mean 2.199465 15.268106 ... 31.058889 732.023085
## std 0.961845 8.696432 ... 46.776241 554.655358
## min 1.000000 1.000000 ... 1.000000 11.000000
## 25% 1.000000 8.000000 ... 5.000000 328.000000
## 50% 2.000000 15.000000 ... 15.000000 587.000000
## 75% 3.000000 22.000000 ... 38.000000 967.000000
## max 4.000000 31.000000 ... 1547.000000 4962.000000
##
## [8 rows x 5 columns]
Now I will try to plot using a log scale
log_binsize = 0.025
bins = 10 ** np.arange(0, np.log10(df_2007_delayed['ArrDelay'].max())+log_binsize, log_binsize)
plt.figure(figsize=[14, 8])
plt.hist(data = df_2007_delayed, x = 'ArrDelay', bins = bins)## (array([1.8399e+04, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
## 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
## 0.0000e+00, 0.0000e+00, 1.8498e+04, 0.0000e+00, 0.0000e+00,
## 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8307e+04,
## 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8316e+04,
## 0.0000e+00, 0.0000e+00, 1.8036e+04, 0.0000e+00, 0.0000e+00,
## 0.0000e+00, 1.7587e+04, 0.0000e+00, 1.7161e+04, 0.0000e+00,
## 0.0000e+00, 1.6417e+04, 0.0000e+00, 1.5996e+04, 0.0000e+00,
## 1.5876e+04, 1.4837e+04, 0.0000e+00, 1.4764e+04, 1.4191e+04,
## 1.3548e+04, 0.0000e+00, 1.3196e+04, 1.2495e+04, 1.1998e+04,
## 1.1600e+04, 1.1132e+04, 2.1177e+04, 1.0075e+04, 9.4680e+03,
## 1.8348e+04, 8.5540e+03, 1.6208e+04, 7.6150e+03, 1.4665e+04,
## 1.3475e+04, 1.2803e+04, 1.1844e+04, 1.0983e+04, 1.5359e+04,
## 9.1860e+03, 1.3035e+04, 1.2071e+04, 1.1038e+04, 1.0027e+04,
## 9.0590e+03, 1.1371e+04, 7.6860e+03, 9.5340e+03, 8.6130e+03,
## 9.6740e+03, 8.6730e+03, 7.5970e+03, 6.9870e+03, 6.3160e+03,
## 6.6660e+03, 6.9340e+03, 5.1820e+03, 5.3670e+03, 5.2750e+03,
## 4.5440e+03, 3.8570e+03, 3.7300e+03, 3.1440e+03, 3.0050e+03,
## 2.7560e+03, 2.3030e+03, 2.0610e+03, 1.6000e+03, 1.6490e+03,
## 1.1400e+03, 1.0150e+03, 8.1500e+02, 7.6500e+02, 5.4800e+02,
## 4.7900e+02, 3.6400e+02, 2.7900e+02, 2.6300e+02, 1.6300e+02,
## 1.2500e+02, 1.0400e+02, 1.2000e+02, 8.0000e+01, 6.3000e+01,
## 4.9000e+01, 5.0000e+01, 3.9000e+01, 4.7000e+01, 3.5000e+01,
## 2.1000e+01, 2.2000e+01, 2.6000e+01, 2.5000e+01, 1.5000e+01,
## 2.0000e+01, 2.2000e+01, 8.0000e+00, 9.0000e+00, 4.0000e+00,
## 3.0000e+00, 1.0000e+00, 1.0000e+00]), array([1.00000000e+00, 1.05925373e+00, 1.12201845e+00, 1.18850223e+00,
## 1.25892541e+00, 1.33352143e+00, 1.41253754e+00, 1.49623566e+00,
## 1.58489319e+00, 1.67880402e+00, 1.77827941e+00, 1.88364909e+00,
## 1.99526231e+00, 2.11348904e+00, 2.23872114e+00, 2.37137371e+00,
## 2.51188643e+00, 2.66072506e+00, 2.81838293e+00, 2.98538262e+00,
## 3.16227766e+00, 3.34965439e+00, 3.54813389e+00, 3.75837404e+00,
## 3.98107171e+00, 4.21696503e+00, 4.46683592e+00, 4.73151259e+00,
## 5.01187234e+00, 5.30884444e+00, 5.62341325e+00, 5.95662144e+00,
## 6.30957344e+00, 6.68343918e+00, 7.07945784e+00, 7.49894209e+00,
## 7.94328235e+00, 8.41395142e+00, 8.91250938e+00, 9.44060876e+00,
## 1.00000000e+01, 1.05925373e+01, 1.12201845e+01, 1.18850223e+01,
## 1.25892541e+01, 1.33352143e+01, 1.41253754e+01, 1.49623566e+01,
## 1.58489319e+01, 1.67880402e+01, 1.77827941e+01, 1.88364909e+01,
## 1.99526231e+01, 2.11348904e+01, 2.23872114e+01, 2.37137371e+01,
## 2.51188643e+01, 2.66072506e+01, 2.81838293e+01, 2.98538262e+01,
## 3.16227766e+01, 3.34965439e+01, 3.54813389e+01, 3.75837404e+01,
## 3.98107171e+01, 4.21696503e+01, 4.46683592e+01, 4.73151259e+01,
## 5.01187234e+01, 5.30884444e+01, 5.62341325e+01, 5.95662144e+01,
## 6.30957344e+01, 6.68343918e+01, 7.07945784e+01, 7.49894209e+01,
## 7.94328235e+01, 8.41395142e+01, 8.91250938e+01, 9.44060876e+01,
## 1.00000000e+02, 1.05925373e+02, 1.12201845e+02, 1.18850223e+02,
## 1.25892541e+02, 1.33352143e+02, 1.41253754e+02, 1.49623566e+02,
## 1.58489319e+02, 1.67880402e+02, 1.77827941e+02, 1.88364909e+02,
## 1.99526231e+02, 2.11348904e+02, 2.23872114e+02, 2.37137371e+02,
## 2.51188643e+02, 2.66072506e+02, 2.81838293e+02, 2.98538262e+02,
## 3.16227766e+02, 3.34965439e+02, 3.54813389e+02, 3.75837404e+02,
## 3.98107171e+02, 4.21696503e+02, 4.46683592e+02, 4.73151259e+02,
## 5.01187234e+02, 5.30884444e+02, 5.62341325e+02, 5.95662144e+02,
## 6.30957344e+02, 6.68343918e+02, 7.07945784e+02, 7.49894209e+02,
## 7.94328235e+02, 8.41395142e+02, 8.91250938e+02, 9.44060876e+02,
## 1.00000000e+03, 1.05925373e+03, 1.12201845e+03, 1.18850223e+03,
## 1.25892541e+03, 1.33352143e+03, 1.41253754e+03, 1.49623566e+03,
## 1.58489319e+03]), <BarContainer object of 128 artists>)
plt.xscale('log')
# plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4], [500, '1k', '2k', '5k', '10k', '20k'])
plt.xlabel('Arrival Delays (mins)')
plt.show()I will look at the departure delays and arrival delays. Specifically, I will focus on the arrival delay and its relationship with distance. I want to see if longer flights tend to have longer delays.
# Dropping rows with missing values
print("Original rows and columns =",df_2007.shape)## Original rows and columns = (2000000, 5)
df_2007_sampled = df_2007.dropna(subset=['ArrDelay', 'Distance']).sample(n=20000, replace = False)
print("Sampled rows and columns =",df_2007_sampled.shape)## Sampled rows and columns = (20000, 5)
print(df_2007_sampled.info())## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 20000 entries, 841926 to 869387
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Month 20000 non-null int8
## 1 DayofMonth 20000 non-null int8
## 2 ArrDelay 20000 non-null float32
## 3 DepDelay 20000 non-null float32
## 4 Distance 20000 non-null int64
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 507.8 KB
## None
print(df_2007_sampled.head())## Month DayofMonth ArrDelay DepDelay Distance
## 841926 2 19 50.0 50.0 862
## 1887239 4 20 -1.0 7.0 389
## 1991294 4 16 -8.0 0.0 64
## 507955 1 2 16.0 -6.0 1236
## 371612 1 20 -1.0 13.0 1946
plt.figure(figsize=[12, 12])
plt.scatter(data=df_2007_sampled, x='Distance', y='ArrDelay')
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()There seems to be very little correlation from this plot. Since there are a lot of overlapping points, I will apply some transparency to get a better picture.
plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()Most of the flights had distances between the range of 2500 miles and delays within 400 minutes. Again I will be removing the negative values since I am only interested in flights that arrived later than expected.
df_2007_sampled = df_2007_sampled.query("0 < ArrDelay < 200")
plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.show();Many of the flights with shorter distances seemed to have shorter delays since the density of the dots seem to fade as it gets away from the 0 mark. Lets try to get a different picture with another plot.
x_bins = np.arange(0, 4500+100, 100)
y_bins = np.arange(0, 200+7, 7)
plt.figure(figsize=[12, 12])
plt.hist2d(data=df_2007_sampled, x='Distance', y='ArrDelay', cmin=0.3, cmap='viridis_r', bins = [x_bins, y_bins]);
plt.show();
Its still the same story from what I can see here.
Next, I will analyze the delays on a day-of-week
basis.
get_mean_delay_year <- function(year) {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
# Query for mean delay times in January (actual delays, no early flights)
query_jan <- sprintf("SELECT DayOfWeek, DayOfMonth,
AVG(DepDelay) AS MeanDepDelay,
AVG(ArrDelay) AS MeanArrDelay
FROM table_%s
WHERE DepDelay > 0 AND ArrDelay > 0
GROUP BY DayOfWeek, DayOfMonth", year)
# tbl(airline_db, sql(query_jan)) # Runs the query and displays results without loading it in memory
delays_2007 <- dbGetQuery(airline_db, query_jan) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(delays_2007)
}delays_2007 <- get_mean_delay_year('2007')
head(delays_2007)## DayOfWeek DayofMonth MeanDepDelay MeanArrDelay
## 1 1 1 35.50271 35.54895
## 2 1 2 40.18925 41.42904
## 3 1 3 38.94795 39.51203
## 4 1 4 46.83710 49.46875
## 5 1 5 36.06113 37.46486
## 6 1 6 42.34605 44.32375
skim(delays_2007)| Name | delays_2007 |
| Number of rows | 216 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| DayOfWeek | 0 | 1 | 3.99 | 2.00 | 1.00 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▃▇ |
| DayofMonth | 0 | 1 | 15.93 | 8.93 | 1.00 | 8.00 | 16.00 | 24.00 | 31.00 | ▇▇▇▇▇ |
| MeanDepDelay | 0 | 1 | 38.37 | 6.48 | 24.67 | 33.25 | 37.68 | 41.89 | 61.84 | ▃▇▆▁▁ |
| MeanArrDelay | 0 | 1 | 40.66 | 7.88 | 24.93 | 34.60 | 39.75 | 45.29 | 70.32 | ▅▇▅▁▁ |
For the DayOfWeek data, I will make another column in
the dataframe that shows the text representation (Monday, Tuesday …) so
that it would be easier to understand in the plot.
def change_column_type(df):
# Converting R dataframe to Pandas dataframe
df_delays = pd.DataFrame(df)
# Changing day and month columns from float to integer data types
df_delays = df_delays.astype({'DayOfWeek':'int8', 'DayofMonth':'int8'})
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Creating the new column
df_delays['Day'] = df_delays['DayOfWeek'].apply(lambda x: days_of_week[x-1])
return(df_delays)
delays_2007 = change_column_type(r.delays_2007)
delays_2007.head()## DayOfWeek DayofMonth MeanDepDelay MeanArrDelay Day
## 0 1 1 35.502709 35.548949 Monday
## 1 1 2 40.189248 41.429041 Monday
## 2 1 3 38.947953 39.512035 Monday
## 3 1 4 46.837098 49.468746 Monday
## 4 1 5 36.061127 37.464863 Monday
plt.figure(figsize=[16,12])
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay')
plt.title("Average Flight Delay Times In 2007")
plt.xlabel("Day of Week")
plt.ylabel("Average Departure Delay (min)")
plt.show();From the above chart alone, the average delay times for each weekday seems to be fairly the same. I will see if there is a consistent theme across the years by plotting the chart for 9 consecutive years (1999 - 2008)
I left the colors because, even though the weekday variable is ordinal (i.e. Tuesday comes after Monday and so on), the order doesn’t really matter much in this case because, for example, Friday is not better than Sunday, Monday is not higher than Saturday, etc. The colors will be helpful in identifying each weekday in the subsequent plots.
delays_2000 <- get_mean_delay_year('2000')
delays_2001 <- get_mean_delay_year('2001')
delays_2002 <- get_mean_delay_year('2002')
delays_2003 <- get_mean_delay_year('2003')
delays_2004 <- get_mean_delay_year('2004')
delays_2005 <- get_mean_delay_year('2005')
delays_2006 <- get_mean_delay_year('2006')
delays_2007 <- get_mean_delay_year('2007')
delays_2008 <- get_mean_delay_year('2008')delays_2000 = change_column_type(r.delays_2000)
delays_2001 = change_column_type(r.delays_2001)
delays_2002 = change_column_type(r.delays_2002)
delays_2003 = change_column_type(r.delays_2003)
delays_2004 = change_column_type(r.delays_2004)
delays_2005 = change_column_type(r.delays_2005)
delays_2006 = change_column_type(r.delays_2006)
delays_2007 = change_column_type(r.delays_2007)
delays_2008 = change_column_type(r.delays_2008)fig, ax = plt.subplots(ncols = 3, nrows = 3 , figsize = [17,17])
sns.barplot(data=delays_2000, x='Day', y='MeanDepDelay', ax = ax[0, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2001, x='Day', y='MeanDepDelay', ax = ax[1, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2002, x='Day', y='MeanDepDelay', ax = ax[2, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2003, x='Day', y='MeanDepDelay', ax = ax[0, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2004, x='Day', y='MeanDepDelay', ax = ax[1, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2005, x='Day', y='MeanDepDelay', ax = ax[2, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2006, x='Day', y='MeanDepDelay', ax = ax[0, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay', ax = ax[1, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2008, x='Day', y='MeanDepDelay', ax = ax[2, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
plt.show();From an overview of the chart above, there is no consistent trend to predict which weekdays have more delays. This is expected because there are many other factors to consider like the month, the season, holidays, airport carrier, plane age and global events.
Now I will examine the cancelled and diverted flights, relative to the carriers (airline companies). I want to see if flights from a carrier tend to get get cancelled or diverted more than others.
Below is an SQL query to get all flights that were either diverted or cancelled and group them by the flight carrier. On the original table, there is a column for diverted (1 or 0) and another column for cancelled (1 or 0). I believe the data is not completely tidy because a flight that is cancelled cannot be diverted and vice-versa. So I combined them to a single column indicating whether the flight was cancelled or diverted.
get_changed_flights <- function(year) {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
query <- sprintf("SELECT
UniqueCarrier,
Description AS Carrier,
CASE
WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
END AS FlightStatus,
COUNT(*) AS Flights
FROM table_%s
LEFT JOIN carriers
ON table_%s.UniqueCarrier = carriers.Code
WHERE Diverted = 1 OR Cancelled = 1
GROUP BY UniqueCarrier, Carrier, FlightStatus
ORDER BY UniqueCarrier", year, year)
flights_changed <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(flights_changed)
}
flights_changed_2007 <- get_changed_flights('2007')
head(flights_changed_2007, 10)## UniqueCarrier Carrier FlightStatus Flights
## 1 9E Pinnacle Airlines Inc. Cancelled 7939
## 2 9E Pinnacle Airlines Inc. Diverted 726
## 3 AA American Airlines Inc. Cancelled 17924
## 4 AA American Airlines Inc. Diverted 2097
## 5 AQ Aloha Airlines Inc. Cancelled 388
## 6 AQ Aloha Airlines Inc. Diverted 15
## 7 AS Alaska Airlines Inc. Cancelled 2563
## 8 AS Alaska Airlines Inc. Diverted 499
## 9 B6 JetBlue Airways Cancelled 3710
## 10 B6 JetBlue Airways Diverted 624
Lets see the number of cancelled and diverted flights in this data (the 2006 data).
flights_changed_2007 = pd.DataFrame(r.flights_changed_2007)
flights_changed_2007.head()## UniqueCarrier Carrier FlightStatus Flights
## 0 9E Pinnacle Airlines Inc. Cancelled 7939
## 1 9E Pinnacle Airlines Inc. Diverted 726
## 2 AA American Airlines Inc. Cancelled 17924
## 3 AA American Airlines Inc. Diverted 2097
## 4 AQ Aloha Airlines Inc. Cancelled 388
print(f"Total number of diverted or cancelled flights in 2007: {flights_changed_2007.Flights.sum()}")## Total number of diverted or cancelled flights in 2007: 177927
plt.figure(figsize=[16,12])
sns.barplot(data = flights_changed_2007, x = 'UniqueCarrier', y = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()We can already see that MQ had the highest number of
cancelled flights (by a relatively wide margin). To better understand
the plot, I will make it horizontal and order it by number of cancelled
flights.
#Sort by descending order of number of flights
flights_changed_2007 = flights_changed_2007.sort_values(by='Flights', ascending=False)
plt.figure(figsize=[20,14])
sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()Tts easier to see that the airline companies with the most number of
cancelled flight is E and
diverted flights are A B C. So we can at
least judge that the carriers at the bottom of the chart have a good
record of flight data (though I am aware that the chart could be like
that because those companies at the bottom do not have as many flights
as those at the top).
Now I’m going to plot the chart for the most recent 6 years from the
dataset, to see if this trend is the same across the years,
i.e, to see if the same companies are always on top.
flights_changed_2003 = get_changed_flights('2003')
flights_changed_2004 = get_changed_flights('2004')
flights_changed_2005 = get_changed_flights('2005')
flights_changed_2006 = get_changed_flights('2006')
flights_changed_2007 = get_changed_flights('2007')
flights_changed_2008 = get_changed_flights('2008') flights_changed_2003 =pd.DataFrame(r.flights_changed_2003).sort_values(by='Flights', ascending=False)
flights_changed_2004 =pd.DataFrame(r.flights_changed_2004).sort_values(by='Flights', ascending=False)
flights_changed_2005 =pd.DataFrame(r.flights_changed_2005).sort_values(by='Flights', ascending=False)
flights_changed_2006 =pd.DataFrame(r.flights_changed_2006).sort_values(by='Flights', ascending=False)
flights_changed_2007 =pd.DataFrame(r.flights_changed_2007).sort_values(by='Flights', ascending=False)
flights_changed_2008 =pd.DataFrame(r.flights_changed_2008).sort_values(by='Flights', ascending=False)plt.figure(figsize=[16,12])
base_color = sns.color_palette()[2]
plt.suptitle("2003 to 2006")
plt.subplot(3, 2, 1)
sns.barplot(data = flights_changed_2003, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2003")
plt.subplot(3, 2, 2)
sns.barplot(data = flights_changed_2004, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2004")
plt.subplot(3, 2, 3)
sns.barplot(data = flights_changed_2005, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2005")
plt.subplot(3, 2, 4)
sns.barplot(data = flights_changed_2006, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2006")
plt.subplot(3, 2, 5)
sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2007")
plt.subplot(3, 2, 6)
sns.barplot(data = flights_changed_2008, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2008")
# Hiding the axis ticks and tick labels of the bigger plot
plt.tick_params(labelcolor="none", bottom=False, left=False)
# Adding the x-axis and y-axis labels for the bigger plot
plt.xlabel('Common X-Axis', fontsize=15, fontweight='bold')
plt.ylabel('Common Y-Axis', fontsize=15, fontweight='bold')
plt.show()We can see that the same companies are at the top
Next I want to analyze the number of delays from a general perspective. With the level of advancements in technology, I expect to see a relative reduction in the percentage of delayed flights each year.
The following query takes over 20 minutes to run so I saved the data in a CSV file after running it, so that I don’t have to rerun it unless I change something.
get_all_flights <- function() {
airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
s <- "Year,
CASE
WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
ELSE 'Arrived'
END AS FlightStatus,
COUNT(*) AS Flights"
query <- sprintf("
SELECT %s FROM table_1987 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1988 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1989 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1990 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1991 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1992 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1993 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1994 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1995 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1996 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1997 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1998 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_1999 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2000 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2001 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2002 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2003 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2004 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2005 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2006 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2007 GROUP BY Year, FlightStatus
UNION ALL
SELECT %s FROM table_2008 GROUP BY Year, FlightStatus",
s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s)
all_flights <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
dbDisconnect(airline_db) # Disconnect from the database when done
return(all_flights)
}
# all_flights <- get_all_flights()
# write.csv(all_flights,"all_flights.csv", row.names = FALSE)all_flights <- read.csv('all_flights.csv')
all_flights## Year FlightStatus Flights
## 1 1987 Arrived 1288326
## 2 1987 Cancelled 19685
## 3 1987 Diverted 3815
## 4 1988 Arrived 5137497
## 5 1988 Cancelled 50163
## 6 1988 Diverted 14436
## 7 1989 Arrived 4952196
## 8 1989 Cancelled 74165
## 9 1989 Diverted 14839
## 10 1990 Arrived 5202481
## 11 1990 Cancelled 52458
## 12 1990 Diverted 15954
## 13 1991 Arrived 5020835
## 14 1991 Cancelled 43505
## 15 1991 Diverted 12585
## 16 1992 Arrived 5027937
## 17 1992 Cancelled 52836
## 18 1992 Diverted 11384
## 19 1993 Arrived 5000323
## 20 1993 Cancelled 59845
## 21 1993 Diverted 10333
## 22 1994 Arrived 5101202
## 23 1994 Cancelled 66740
## 24 1994 Diverted 12106
## 25 1995 Arrived 5225038
## 26 1995 Cancelled 91905
## 27 1995 Diverted 10492
## 28 1996 Arrived 5209326
## 29 1996 Cancelled 128536
## 30 1996 Diverted 14121
## 31 1997 Arrived 5301999
## 32 1997 Cancelled 97763
## 33 1997 Diverted 12081
## 34 1998 Arrived 5227051
## 35 1998 Cancelled 144509
## 36 1998 Diverted 13161
## 37 1999 Arrived 5360018
## 38 1999 Cancelled 154311
## 39 1999 Diverted 13555
## 40 2000 Arrived 5481303
## 41 2000 Cancelled 187490
## 42 2000 Diverted 14254
## 43 2001 Arrived 5723673
## 44 2001 Cancelled 231198
## 45 2001 Diverted 12909
## 46 2002 Arrived 5197860
## 47 2002 Cancelled 65143
## 48 2002 Diverted 8356
## 49 2003 Arrived 6375690
## 50 2003 Cancelled 101469
## 51 2003 Diverted 11381
## 52 2004 Arrived 6987729
## 53 2004 Cancelled 127757
## 54 2004 Diverted 13784
## 55 2005 Arrived 6992838
## 56 2005 Cancelled 133730
## 57 2005 Diverted 14028
## 58 2006 Arrived 7003802
## 59 2006 Cancelled 121934
## 60 2006 Diverted 16186
## 61 2007 Arrived 7275288
## 62 2007 Cancelled 160748
## 63 2007 Diverted 17179
## 64 2008 Arrived 2319121
## 65 2008 Cancelled 64442
## 66 2008 Diverted 5654
# Only flights that did not arrive
not_arrived = pd.DataFrame(r.all_flights).query("FlightStatus != 'Arrived'")
# not_arrived
plt.figure(figsize=[16,12])
base_color = sns.color_palette()[2]
sns.lineplot(data=not_arrived, x='Year', y='Flights', hue='FlightStatus')
plt.show()